data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...

, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data. Anomaly detection finds application in many domains including cyber security, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers. Three broad categories of anomaly detection techniques exist. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not the techniques construct a model representing

normal behavior Normality is a behavior that can be normal for an individual (intrapersonal normality) when it is consistent with the most common behavior for that person. Normal is also used to describe individual behavior that conforms to the most common beha ...

from a given ''normal'' training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.

Definition

Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include: * An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. * Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data. * An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. * An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features. * Anomalies are patterns in data that do not conform to a well defined notion of normal behaviour. * Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only if O is an outlier.

Applications

Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using machine vision, medical diagnosis and law enforcement. Anomaly detection was proposed for intrusion detection systems (IDS) by

Dorothy Denning Dorothy Elizabeth Denning (née Robling, born August 12, 1945) is a US-American information security researcher known for lattice-based access control (LBAC), intrusion detection systems (IDS), and other cyber security innovations. She publis ...

in 1986. Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with

soft computing Soft computing is a set of algorithms, including neural networks, fuzzy logic, and evolutionary algorithms. These algorithms are tolerant of imprecision, uncertainty, partial truth and approximation. It is contrasted with hard computing: al ...

, and inductive learning. Types of statistics proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. The counterpart of anomaly detection in intrusion detection is

misuse detection Misuse detection actively works against potential insider threats to vulnerable computer data. Misuse Misuse detection is an approach to detecting computer attacks A computer is a machine that can be programmed to Execution (computing), c ...

. It is often used in preprocessing to remove anomalous data from the dataset. This is done for a number of reasons. Statistics of data such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy. Anomalies are also often the most important observations in the data to be found such as in intrusion detection or detecting abnormalities in medical images.

Popular techniques

Many anomaly detection techniques have been proposed in literature. Some of the popular techniques are: * Statistical ( Z-score, Tukey's range test and

Grubbs's test In statistics, Grubbs's test or the Grubbs test (named after Frank E. Grubbs, who published the test in 1950), also known as the maximum normalized residual test or extreme studentized deviate test, is a test used to detect outliers in a univariate ...

) * Density-based techniques (

k-nearest neighbor In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regre ...

, local outlier factor,

isolation forest Isolation Forest is an algorithm for data anomaly detection initially developed by Fei Tony Liu and Zhi-Hua Zhou in 2008. Isolation Forest detects anomalies using binary trees. The algorithm has a linear time complexity and a low memory requireme ...

s, and many more variations of this concept) * Subspace-, correlation-based and tensor-based outlier detection for high-dimensional data * One-class

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

* Replicator

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

s, autoencoders, variational autoencoders, long short-term memory neural networks * Bayesian networks * Hidden Markov models (HMMs) * Minimum Covariance Determinant * Clustering:

Cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

-based outlier detection * Deviations from

association rules Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.P ...

and frequent itemsets * Fuzzy logic-based outlier detection * Ensemble techniques, using feature bagging, score normalization and different sources of diversity The performance of methods depends on the data set and parameters, and methods have little systematic advantages over another when compared across many data sets and parameters.

Software

* ELKI is an open-source Java data mining toolkit that contains several anomaly detection algorithms, as well as index acceleration for them. *PyOD is an open-source Python library developed specifically for anomaly detection. * scikit-learn is an open-source Python library that has built functionality to provide unsupervised anomaly detection. * Wolfram Mathematica provides functionality for unsupervised anomaly detection across multiple data types
Mathematica documentation

Datasets

Anomaly detection benchmark data repository
with carefully chosen data sets of the Ludwig-Maximilians-Universität München
Mirror
at University of São Paulo.
ODDS
– ODDS: A large collection of publicly available outlier detection datasets with ground truth in different domains.
Unsupervised Anomaly Detection Benchmark
at Harvard Dataverse: Datasets for Unsupervised Anomaly Detection with ground truth.
KMASH Data Repository
at Research Data Australia having more than 12,000 anomaly detection datasets with ground truth.

References